I. Introduction

Carbon dioxide (CO2) is a greenhouse gas and a primary driver of climate change. A greenhouse gas is a gas that traps heat in the atmosphere, and a potential factor in global warming that we have observed in recent time. Specifically, CO2 can enter the Earth’s atmosphere by the burning of fossil fuels, waste, trees, and other chemical processes. In general much of the rise in CO2 levels is associated with human activity.

The time series dataset for this project comes from the Mauna Loa Observatory in Hawaii. It measures the monthly mean carbon dioxide levels in parts per million from January 1980, until January 2018. More specifically, the units are expressed as the number of CO2 moles in the air, divided by the total number of moles in the air, after water vapor is removed. This is known as parts-per million (PPM), which is a way to measure the concentration of a substance. So, each observation in the data represents the monthly mean concentration in PPM for a specific month since 1980. Some quick things to note: these measurements are subject to minimal changes due to quality control checks (but should not effect this analysis), and the values are more representative of Northern Hemisphere concentrations, not global.

The motivation for analyzing this particular topic is to use statistical models to understand and forecast climate change. Because CO2 is an indicator of climate change, it is necessary to explain its patterns so that we can raise awareness of its increasing trend and the dangers it poses to the continued existence of life on Earth. As we dive into the data, we will break down the time series’ trend and seasonal components through various models, and use these models to suggest future behavior.


II. Results

1. Modeling and Forecasting Trend

a) Show a time-series plot of your data.

b) Does your plot in (a) suggest that the data are covariance stationary? Explain your answer.

No, the time series plot does not suggest covariance stationarity. We observe an upward trend with annual seasonal behavior. There is no mean reversion, but the time series does suggest constant variance.

c) Plot and discuss the ACF and PACF of your data.

The Autocorrelation Function of global carbon dioxide monthly means suggests strong time dependence. There is a slow decay in autocorrelation over a wide range of lags. In assessing this plot with domain knowledge, this observation is reasonable. The first time series plot of the monthly means showed a clear upward trend, and we know that heat levels change continuously (temperature cannot go from 70 degrees to 72 degrees without first hitting 71 degrees). So we would expect to see an ACF curve with these characteristics for this time series.

The Partial Autocorrelation Function of global carbon dioxide monthly means has a high value at one period of lag, and no evidence of time depence past this. There is one bar outside of the 95% confidence interval at a lag of 14, but we will reject that is carries any real meaning. Again, this concept aligns with the idea about heat levels, that only the immediate previous value should affect the current. Note that we also observe slight oscillation within the confidence interval. Together, the ACF and PACF suggest that this time series is an Autoregressive process.

d) Fit a linear and nonlinear model to your series. In one window, show both figures of the original times series plot with the respective fit.

e) For each model, plot the respective residuals vs. fitted values and discuss your observations.

The linear model fitted values and residuals plot tells us that this model roughly underpredicts, overpredicts, and then underpredicts again on the actual values of the time series. This U-shaped representation of our fits and residuals tells us that the time series trend does not necessarily follow a linear pathway, and that some curviture may better capture this general structure. The structure also suggests that there is a seasonal factor that the linear model failed to explain. The range of values goes from -6 to 6, which comes from the fluctuating nature of the series about the overall trend. Also note the dashed line representing the mean of the residuals, essentially at zero.

The nonlinear model’s fitted values and residuals show a great improvement relative to the linear model. In terms of the horizontal aspect of the plot, the fitted values look uniformly distributed over the range of the x-axis. The vertical aspect, being the residuals, shows more densely clustered positive points, and more varied negative points. There is not the U-shaped structure in this plot, meaning that our nonlinear model fits the overall trend much better. Still, the range of the residuals is about -4, to 4, meaning that the nonlinear model is still failing to capture some seasonal behavior. We can conclude that the polynomial model explains the overall trend of global carbon dioxide levels significantly better than a plain linear model, but still fails to explain seasonality with the trend.

f) For each model, plot a histogram of the residuals and discuss your observations.

From looking at this histogram alone, we can see that the linear models residuals are somewhat normally distributed. Rememeber from the time series and fitted values vs residual plot, there was an underpredicted, overpredicted, and underpredicted pattern. This histogram tells us that these under and over predictions roughly “cancel” each other out which could deceptively tell us that the linear model is a good model. However, we know from above that it fails to capture seasonal components.

The nonlinear model residuals, on the other hand, are negatively skewed. While the mean value is 0, we can see that there is more dense clustering of positive values and less dense clustering of the negative values. Since residuals are the difference of observed and predicted values, this model tends to underpredict the recorded values. This could be due to the rising trend of the carbon dioxide levels mixed with seasonal components that are not explicitly captured in the nonlinear model.

g) For each model, discuss the associated diagnostic statistics (R2, t−distribution, F−distribution, etc.)
Linear Model
## 
## Call:
## lm(formula = ppm ~ ts, data = c02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8050 -1.5913 -0.0022  1.6397  5.8096 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.148e+03  1.988e+01  -158.4   <2e-16 ***
## ts           1.759e+00  9.943e-03   176.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.337 on 455 degrees of freedom
## Multiple R-squared:  0.9857, Adjusted R-squared:  0.9856 
## F-statistic: 3.129e+04 on 1 and 455 DF,  p-value: < 2.2e-16

The summary tells us that both the intercept and the ts variables are statistically significant from their low p-values and high t values. The adjusted R2 score is high at 98.56, meaning that the linear model performs quite well and that this particular time series is straightfoward to model. The F-statistic is essentially zero, meaning that some variable in this model is statistically significant. We note a residual standard error of 2.34, which we can probably improve upon with the nonlinear model.

We can also examine the Cook’s Distance and QQ Normal plot of the residuals. The Cook’s distances are all very low (note the y-axis scale), indicating that there are no extreme values in our data that could influence our model. The highest value is just under 0.03. The QQ normal plot aligns with our observation on the residuals histogram, that the linear models residuals appear somewhat normally distributed.

Nonlinear Model
## 
## Call:
## lm(formula = ppm ~ ts + I(ts^2), data = c02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9824 -1.2656  0.3246  1.2095  3.5441 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.801e+04  2.850e+03   20.35   <2e-16 ***
## ts          -5.943e+01  2.852e+00  -20.84   <2e-16 ***
## I(ts^2)      1.531e-02  7.133e-04   21.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.648 on 454 degrees of freedom
## Multiple R-squared:  0.9929, Adjusted R-squared:  0.9929 
## F-statistic: 3.168e+04 on 2 and 454 DF,  p-value: < 2.2e-16

The polynomial model has a slightly higher R2 score at 99.29, which is a 0.73 improvement on the linear model. The p-values and t values of both the ts and ts2 paramteres indicate statistical significance. And the F-statistic also tells us that some parameters in this model are statistically significant. The residual standard error is 1.648, which is lower than the linear model, but can still be reduced by adding in a seasonal component to the model.

The Cook’s Distances for the polynomial model are also low. The maximum value here is only slightly above 0.02, which is lower than the linear model. Likewise, the polynomial model does not appear to have suffered from any influential data points. We can however, see (again) that the polynomial model residuals are less characteristic of a normal distribution compared to the linear model residuals, but has a more dense distribution range.

h) Select a trend model using AIC and one using BIC (show the values obtained from each criterion). Do the selected models agree?
##                AIC      BIC
## Linear    2076.671 2089.045
## Nonlinear 1758.682 1775.181

Both the AIC and BIC scores indicate that the nonlinear model should be selected.

i) Use your preferred model to forecast h-steps (at least 16) ahead.

The forecast plot shows the nonlinear model’s monthly forecasts for the next two years. As expected, the point forecasts follow a “linear” path that looks to just continue on from the fitted values. Note that the width of the prediction interval appears to coincide with the amplitude of the oscillating nature of the time series.

2. Modeling and Forecasting Seasonality

a) Construct and test (by looking at the diagnostic statistics) a model with a full set of seasonal dummies.
## 
## Call:
## lm(formula = ppm ~ is_2 + is_3 + is_4 + is_5 + is_6 + is_7 + 
##     is_8 + is_9 + is_10 + is_11 + is_12, data = c02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.949 -15.592  -1.449  16.512  38.141 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 369.3987     3.1536 117.137   <2e-16 ***
## is_2TRUE     -0.5221     4.4891  -0.116    0.907    
## is_3TRUE     -0.1035     4.4891  -0.023    0.982    
## is_4TRUE      0.2900     4.4891   0.065    0.949    
## is_5TRUE      0.3110     4.4891   0.069    0.945    
## is_6TRUE     -0.5077     4.4891  -0.113    0.910    
## is_7TRUE     -2.0811     4.4891  -0.464    0.643    
## is_8TRUE     -3.3808     4.4891  -0.753    0.452    
## is_9TRUE     -3.2919     4.4891  -0.733    0.464    
## is_10TRUE    -1.9966     4.4891  -0.445    0.657    
## is_11TRUE    -0.6769     4.4891  -0.151    0.880    
## is_12TRUE     0.1915     4.4891   0.043    0.966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.69 on 445 degrees of freedom
## Multiple R-squared:  0.004446,   Adjusted R-squared:  -0.02016 
## F-statistic: 0.1807 on 11 and 445 DF,  p-value: 0.9985

The seasonal only model performs badly on our time series. None of the variables are statistically significant and the adjusted R2 score is -0.02, which means this model fits our data worse than a plain horizontal line. This indicates that the trend component of this time series is important to model. The only statistically significant coefficient is our intercept at 369.4.

Here is a plot of the seasonal models fitted values overlayed on the observed values. We can see that the rise-and-fall nature is represented well, but without the rising trend this model underperforms significantly. There is only a 1-2 year range where the seasonal model fits the data.

b) Plot the estimated seasonal factors and interpret your plot.

Our plot shows us the large intercept at 369.4, and captures the a good starting value for the PPM values that range from about 340 to 410. Let us note that the other seasonal parameters have much lower values. This suggests the marginal changes in CO2 concentration over the course of a year. In terms of actual patterns, from February till about June we see slight positive weights, and from July until October we observe slightly larger negative coefficients. Remember that this data is recorded in Hawaii, so this data follows the pattern of the Northern hemisphere seasons.

c) In order to improve your model, add the trend model from problem 1 to your seasonal model. We will refer to this model as the full model. For the full model, plot the respective residuals vs. fitted values and discuss your observations.

Here is the plot of the full models fitted values overlayed on the observed values. Just from a visual perspective, we can see a large improvement when combining both trend and seasonal components.

The residuals of the full model are more densely concentrated about the horizontal mean line at zero. Compared to the linear and nonlinear models, the residual range is considerably lower. We can see some slight structure following a oscillation of overpredicting and underpredicting of the observed values. This may suggest some low magnitude persistence or cycles in the data. Overall, the variance of the residuals are lower and we can conclude that the full model shows significant improvement in fitting this time series.

d) Interpret the respective summary statistics including the error metrics of your full model.
## 
## Call:
## lm(formula = ppm ~ ts + I(ts^2) + is_2 + is_3 + is_4 + is_5 + 
##     is_6 + is_7 + is_8 + is_9 + is_10 + is_11 + is_12, data = c02)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3874 -0.5044 -0.1069  0.3388  1.8569 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.783e+04  1.179e+03  49.062  < 2e-16 ***
## ts          -5.925e+01  1.179e+00 -50.243  < 2e-16 ***
## I(ts^2)      1.526e-02  2.950e-04  51.736  < 2e-16 ***
## is_2TRUE     3.068e-01  1.554e-01   1.974 0.048950 *  
## is_3TRUE     5.797e-01  1.554e-01   3.731 0.000216 ***
## is_4TRUE     8.271e-01  1.554e-01   5.324 1.62e-07 ***
## is_5TRUE     7.020e-01  1.554e-01   4.518 8.01e-06 ***
## is_6TRUE    -2.631e-01  1.554e-01  -1.694 0.091056 .  
## is_7TRUE    -1.983e+00  1.554e-01 -12.764  < 2e-16 ***
## is_8TRUE    -3.430e+00  1.554e-01 -22.075  < 2e-16 ***
## is_9TRUE    -3.488e+00  1.554e-01 -22.449  < 2e-16 ***
## is_10TRUE   -2.340e+00  1.554e-01 -15.060  < 2e-16 ***
## is_11TRUE   -1.168e+00  1.554e-01  -7.515 3.18e-13 ***
## is_12TRUE   -4.468e-01  1.554e-01  -2.876 0.004224 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6816 on 443 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 2.867e+04 on 13 and 443 DF,  p-value: < 2.2e-16

The full model performed the best so far. We have statistical significance on about every variable except for the month of June. Also note that February barely passed at the 95% significance level. As expected, the F-statistic tells us that some variable in the model is important in explaining CO2 levels. The adjusted R2 score is the best at 99.88, which is a 1.32 increase from the linear model and a 0.59 increase from the nonlinear model. This tells us that the polynomial trend combined with some seasonal factors improved our model of this time series. The standard residual error is also at its lowest of 0.68, as compared with 2.34 and 1.65 from the earlier models.

Plotting the histogram of residuals plot of the full model reaffirms the better overall fit. The Cook’s Distances are the lowest of any of the models, and the QQ normal plot suggests that our residuals are distributed more densely towards the mean compared to a normal distribution.

e) Use the full model to forecast h-steps (at least 16) ahead. Your forecast should include the respective prediction interval.

The forecasts from the full model look much more reasonable compared to the nonlinear model. It incorporates the seasonal behaviors of the time series and the overall upward trend. Also note that the prediction interval is narrower in range.


III. Conclusions and Future Work

To conclude, the monthly means of carbon dioxide levels is a relatively straightforward stochastic process to model. Right away, we observed high R2 scores on the linear and nonlinear models, neither of which incorporated any seasonality. When using the full model, we found statistically significance on every parameter but the month of June at the 95% level. We were able to achieve an R2 score of 99.88, which althought seems small in number, is a significant improvement upon 98.56 and 99.3 when considering the range of values in the data.

While our full model already does a great job at fitting this time series, there is still room for improvement. Remember from the residual vs fitted values plot we saw some leftover structure suggesting cycles or persistence Intuitively, we could imagane that CO2 concentration could exhibit persistence across some spans of years. There could be 3-4 year times when slightly higher concentrations are present, and vice versa for low. If we look back at the full model’s fitted values on the observed values, we can note a consistently underpredicted state between 1987 and 1991, and an overpredicted one from about 2009-2013. Even in our full model forecast plot right above, one may guess that the forecasts for 2018 to 2020 will also be underpredicted. For the context of this problem, the full model is already doing an adequate job, so we will leave it here. Yet in other contexts marginal forecast errors like this could weigh more heavily on a loss function. With this said, future work on this time series could include testing ARMA models to try to pick-up cyclical behavior and completely eliminate any structure in the residuals. Additionally, plotting the trajectory of the cumulative residuals could give more insight in the potential for persistent characteristics to be analyzed.


IV. References

Dr. Pieter Tans, NOAA/ESRL (www.esrl.noaa.gov/gmd/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/).


V. R Source Code

# Load needed packages
library(tidyverse)
library(ggplot2); theme_set(theme_dark())
library(reshape)
library(data.table)
library(grid)
library(gridExtra)
library(zoo)
library(tis)
library(forecast)
library(cowplot)  # Plot grids


# Load data
temp_c02 <- read.table(file='/Users/noahkawasaki/Desktop/ECON 144/Week 4/c02.txt')
# Change column names
setnames(temp_c02, c('year', 'month_num', 'ts_raw', 'ppm', 'trend'))

# Create dataframe
c02 <- temp_c02 %>%
  mutate(
    # Merge month_num, 01 for generic day, and year into date type
    date=as.Date(paste(temp_c02$month_num, '01', temp_c02$year), format="%m %d %Y"),
    # Time series column for regression
    ts=seq(1980, 2018, length=nrow(temp_c02)),
    # Manually create seasonal dummies
    is_1 = month_num==1,
    is_2 = month_num==2,
    is_3 = month_num==3,
    is_4 = month_num==4,
    is_5 = month_num==5,
    is_6 = month_num==6,
    is_7 = month_num==7,
    is_8 = month_num==8,
    is_9 = month_num==9,
    is_10 = month_num==10,
    is_11 = month_num==11,
    is_12 = month_num==12
    ) %>%
  # Select needed columns
  dplyr::select(
    date,
    ts, 
    ppm,
    is_1, is_2, is_3, is_4, is_5, is_6, is_7, is_8, is_9, is_10, is_11, is_12
    )

## 1. Modeling and Forecasting Trend (5% each)
# (a) Show a time-series plot of your data.
ggplot(c02, aes(x=date, y=ppm)) +
  geom_line(color='green') +
  geom_hline(yintercept=mean(c02$ppm), color='grey', linetype='dashed') +
  ggtitle(expression('Global CO'[2]*' Monthly Means'), '1980-2018') +
  xlab('Date') +
  ylab('PPM')
  
# (b) Does your plot in (a) suggest that the data are covariance stationary? Explain your answer.

# (c) Plot and discuss the ACF and PACF of your data.
# Use ggplot to create my better looking acf and pacf plots
# The acf object
acf_ppm <- acf(c02$ppm, lag.max=20, plot=FALSE)
# Turn into a dataframe to send to ggplot
acf_ppm_df <- as.data.frame(acf_ppm$acf)
# Calculate CI with qnorm and n.used attribute of acf_ppm, which is just the number of observations in the 
# time series.
ci_acf_ppm <- qnorm((1 + 0.95)/2)/sqrt(acf_ppm$n.used)

ggplot(acf_ppm_df, aes(x=seq(nrow(acf_ppm_df)-1, from=0), y=V1)) +
  geom_col(fill='orange') +
  geom_hline(yintercept=ci_acf_ppm, linetype='dashed') +
  geom_hline(yintercept=-ci_acf_ppm, linetype='dashed') +
  geom_hline(yintercept=0) +
  ylim(-0.25, 1) +
  ggtitle('ACF', expression('CO'[2]*' Monthly Means')) +
  xlab('Lag') +
  ylab('ACF')

# Same process for PACF
pacf_ppm <- pacf(c02$ppm, lag.max=20, plot=FALSE)
pacf_ppm_df <- as.data.frame(pacf_ppm$acf)
ci_pacf_ppm <- qnorm((1 + 0.95)/2)/sqrt(pacf_ppm$n.used)

ggplot(pacf_ppm_df, aes(x=seq(nrow(pacf_ppm_df), from=1), y=V1)) +
  geom_col(fill='orange') +
  geom_hline(yintercept=ci_pacf_ppm, linetype='dashed') +
  geom_hline(yintercept=-ci_pacf_ppm, linetype='dashed') +
  geom_hline(yintercept=0) +
  ylim(-0.5, 1) +
  ggtitle('PACF', expression('CO'[2]*' Monthly Means')) +
  xlab('Lag') +
  ylab('PACF')

# (d) Fit a linear and nonlinear (e.g., polynomial, exponential, quadratic + periodic, etc.) model to your series. In one window, show both figures of the original times series plot with the respective fit.
# Create linear model
lr <- lm(ppm~ts, data=c02)

# Color mapper to map each variable to consistent color, to pass to ggplot for legend
cols2 <- c('Observed Values'='green', 'Fitted Values'='black')
ggplot(c02, aes(x=date)) +
  geom_line(aes(y=ppm, color='Observed Values')) +
  geom_line(aes(y=lr$fitted.values, color='Fitted Values'), linetype='dashed') +
  ggtitle(expression('Global CO'[2]*' Monthly Means'), 'Linear Model, 1980-2018') +
  xlab('Date') +
  ylab('PPM') +
  scale_color_manual(name="Lines", values=cols2)

# Create nonlinear model: polynomial degree 2
poly <- lm(ppm~ts+I(ts^2), data=c02)

ggplot(c02, aes(x=date)) +
  geom_line(aes(y=ppm, color='Observed Values')) +
  geom_line(aes(y=poly$fitted.values, color='Fitted Values'), linetype='dashed') +
  ggtitle(expression('Global CO'[2]*' Monthly Means'), 'Nonlinear Model, 1980-2018') +
  xlab('Date') +
  ylab('PPM') +
  scale_color_manual(name="Lines", values=cols2)

# (e) For each model, plot the respective residuals vs. fitted values and discuss your observations.
# Linear
ggplot(c02) +
  geom_line(aes(x=lr$fitted.values, y=lr$residuals), color='orange', alpha=0.9) +
  geom_point(aes(x=lr$fitted.values, y=lr$residuals), color='orange', alpha=0.9) +
  geom_hline(yintercept=mean(lr$residuals), color='black', linetype='dashed') +
  ylim(-6,6) +
  ggtitle('Fitted Values vs Residuals', 'Linear Model') +
  xlab('Fitted Values') +
  ylab('Residuals') 

# Nonlinear
ggplot(c02) +
  geom_line(aes(x=poly$fitted.values, y=poly$residuals), color='orange', alpha=0.9) +
  geom_point(aes(x=poly$fitted.values, y=poly$residuals), color='orange', alpha=0.9) +
  geom_hline(yintercept=mean(poly$residuals), color='black', linetype='dashed') +
  ylim(-6,6) +
  ggtitle('Fitted Values vs Residuals', 'Nonlinear Model') +
  xlab('Fitted Values') +
  ylab('Residuals') 

# (f) For each model, plot a histogram of the residuals and discuss your observations.
# Linear
ggplot(c02) +
  geom_histogram(aes(x=lr$residuals), bins=40, fill='orange', color='black', alpha=1) +
  xlim(-6,6) +
  ggtitle('Histogram of Residuals', 'Linear Model') +
  xlab('Value') +
  ylab('Frequency') 

# Nonlinear
ggplot(c02) +
  geom_histogram(aes(x=poly$residuals), bins=40, fill='orange', color='black', alpha=1) +
  xlim(-6,6) +
  ggtitle('Histogram of Residuals', 'Nonlinear Model') +
  xlab('Value') +
  ylab('Frequency') 

# (g) For each model, discuss the associated diagnostic statistics (R2, t−distribution, F−distribution, etc.)
# Linear
summary(lr)
# Create Cook's Distance object
cooks_lr <- cooks.distance(lr)
# Plot for Cook's Distance
p1 <- ggplot(c02, aes(x=seq(length(cooks_lr)), y=cooks_lr)) +
  geom_line(color='black', size=0.3, alpha=0.8) +
  geom_point(color='orange', size=0.7) +
  ylim(0, 0.03) +
  ggtitle("Cook's Distance", 'Linear Model') +
  xlab('Index') +
  ylab("Cook's Distance")
# Plot for QQ Normal plot
p2 <- ggplot(data=c02) +
  geom_qq(aes(sample=lr$residuals), color='orange', alpha=0.6) +
  ylim(-6,6) +
  ggtitle('QQ Normal Plot of Residuals', 'Linear Model') +
  ylab('Sample Quantiles') +
  xlab('Theoretical Quantiles')
# Combine into a plotgrid with 2 columns, 1 row
plot_grid(p1, p2, nrow=1)

# Nonlinear
summary(poly)
# Same process for nonlinear model
cooks_poly <- cooks.distance(poly)
p3 <- ggplot(c02, aes(x=seq(length(cooks_poly)), y=cooks_poly)) +
  geom_line(color='black', size=0.3, alpha=0.8) +
  geom_point(color='orange', size=0.7) +
  ylim(0, 0.03) +
  ggtitle("Cook's Distance", 'Nonlinear Model') +
  xlab('Index') +
  ylab("Cook's Distance")

p4 <- ggplot(data=c02) +
  geom_qq(aes(sample=poly$residuals), color='orange', alpha=0.6) +
  ylim(-6,6) +
  ggtitle('QQ Normal Plot of Residuals', 'Nonlinear Model') +
  ylab('Sample Quantiles') +
  xlab('Theoretical Quantiles')

plot_grid(p3, p4, nrow=1)
# (h) Select a trend model using AIC and one using BIC (show the values obtained from each criterion). Do the selected models agree?
# Get AIC and BIC scores
aic <- AIC(lr, poly)
bic <- BIC(lr, poly)
# Set up dataframe with scores, add row names
temp_scores <- data.frame(aic[,2], bic[,2], row.names=c('Linear', 'Nonlinear'))
# Change column names
scores <- setnames(temp_scores, c('AIC', 'BIC'))
# Print data
scores

# (i) Use your preferred model to forecast h-steps (at least 16) ahead. Your forecast should include the respective uncertainty prediction interval. Depending on your data, h will be in days, months, years, etc.
# Simulated monthly forcast horizon Feb 2018 - Jan 2020. Minimum and maximum forecast dates
fmin <- parse_date('2018-02-01', '%Y-%m-%d')
fmax <- parse_date('2020-01-01', '%Y-%m-%d')

# Create dataframe with dates, and time series sequence for regression
forecasts <- data.frame(
  ts = seq(2018.1, 2020, length=24),
  date = seq(from=fmin, to=fmax, by='month')
  ) %>%
  # Add seasonal dummy variables for the full model to use later.
  mutate(
    is_1 = month(date)==1,  # Will return TRUE on condition, FALSE otherwise.
    is_2 = month(date)==2,
    is_3 = month(date)==3,
    is_4 = month(date)==4,
    is_5 = month(date)==5,
    is_6 = month(date)==6,
    is_7 = month(date)==7,
    is_8 = month(date)==8,
    is_9 = month(date)==9,
    is_10 = month(date)==10,
    is_11 = month(date)==11,
    is_12 = month(date)==12
  )

# Make predictions for nonlinear model on forecasts dataframe
poly_pred <- predict(poly, newdata=forecasts, se.fit=TRUE)

# Get prediction and confidence intervals
poly_c_int = as.data.frame(predict(poly, forecasts, level=0.95, interval="confidence"))
poly_p_int = as.data.frame(predict(poly, forecasts, level=0.95, interval="prediction"))

# To plot forecasts with original data in ggplot, i will stack the forecasts and c02 dataframes into one dataframe. I will also add the nonlinear models fitted values. The overall number of rows will be 457 original points + 24 forecasts = 481, so I will create dummy NA vectors to pad space on respective variables so each variable vector is 481 in length.
na_457 <- rep(NA, length.out=457)
na_24 <- rep(NA, length.out=24)

# Create dataframe with all original data and nonlinear forecast data. Using the NA vectors from above, each of these vectors will be 481 in length. The observed variables will be defined until the forecast horizon, where they will have NA's. The forecast values will be NA's until the forecast start, and then will be defined values.
poly_df <- data.frame(
    date = c(c02$date, forecasts$date),
    ts = c(c02$ts, forecasts$ts),
    pred = c(na_457, poly_pred$fit),
    p_upr = c(na_457, poly_p_int$upr),
    p_lwr = c(na_457, poly_p_int$lwr),
    c_upr = c(na_457, poly_c_int$upr),
    c_lwr = c(na_457, poly_c_int$lwr),
    ppm = c(c02$ppm, na_24),
    fits = c(poly$fitted.values, na_24)
  )
# This is now a dataframe with the ts series, date, point predictions, and prediction/confidence upper and lower bounds. Pass to ggplot.

# Subset from 2010-2020 to zoom in on more recent data.
poly_subset <- poly_df[361:481,]

# Color mapper for each variable to consistent colors. Pass the ggplot for legend.
cols <- c('Prediction Interval'='darkgray', 'Confidence Interval'='black', 'Point Forecasts'='#3796db', 
          'Observed Values'='green', 'Fitted Values'='black')
ggplot(poly_subset, aes(x=date)) +
  geom_line(aes(y=ppm, color='Observed Values'), na.rm=TRUE) +
  geom_line(aes(y=fits, color='Fitted Values'), linetype='dashed', na.rm=TRUE) +
  geom_ribbon(aes(ymin=p_lwr, ymax=p_upr, fill='Prediction Interval'), alpha=1, na.rm=TRUE) +
  geom_ribbon(aes(ymin=c_lwr, ymax=c_upr, fill='Confidence Interval'), alpha=0.8, na.rm=TRUE) +
  geom_line(aes(y=pred, color='Point Forecasts'), lwd=0.6, na.rm=TRUE) +
  ggtitle(expression('Forecasted CO'[2]*' Monthly Means'), '2010-2020') +
  xlab('Date') +
  ylab('PPM') +
  scale_fill_manual(name="Shades", values=cols) +
  scale_color_manual(name="Lines", values=cols)



## 2. Modeling and Forecasting Seasonality (6% each)
# (a) Construct and test (by looking at the diagnostic statistics) a model with a full set of seasonal dummies.
# Create seasonal model
seasons <- lm(ppm~is_2+is_3+is_4+is_5+is_6+is_7+is_8+is_9+is_10+is_11+is_12, data=c02)
summary(seasons)

ggplot(c02, aes(x=date)) +
  geom_line(aes(y=ppm, color='Observed Values')) +
  geom_line(aes(y=seasons$fitted.values, color='Fitted Values'), linetype='dashed') +
  ggtitle(expression('Global CO'[2]*' Monthly Means'), 'Seasonal Model, 1980-2018') +
  xlab('Date') +
  ylab('PPM') + 
  scale_color_manual(name="Lines", values=cols2)
 
# (b) Plot the estimated seasonal factors and interpret your plot.
# Make dataframe of season coefficient values
d <- data.frame(seasons$coef, vars)
# Vector for ordering season coefficients correctly
vars <- c('Intercept', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12')
# Factor vars column with correct level order
d$vars <- factor(d$vars, levels=c('Intercept', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12'))

ggplot(d, aes(x=vars, y=seasons$coef)) + 
  geom_col(fill='orange') +
  ggtitle('Seasonal Factor Coefficients') +
  xlab('Season') +
  ylab('Coefficient')   

# (c) In order to improve your model, add the trend model from problem 1 to your seasonal model. We will refer to this model as the full model. For the full model, plot the respective residuals vs. fitted values and discuss your observations.
# Create full model. Combine the polynomial features with seasonal binaries.
full <- lm(ppm~ts+I(ts^2)+is_2+is_3+is_4+is_5+is_6+is_7+is_8+is_9+is_10+is_11+is_12, data=c02)

ggplot(c02, aes(x=date)) +
  geom_line(aes(y=ppm, color='Observed Values')) +
  geom_line(aes(y=full$fitted.values, color='Fitted Values'), linetype='dashed') +
  ggtitle(expression('Global CO'[2]*' Monthly Means'), 'Full Model, 1980-2018') +
  xlab('Date') +
  ylab('PPM') + 
  scale_color_manual(name="Lines", values=cols2)

ggplot(c02) +
  geom_line(aes(x=full$fitted.values, y=full$residuals), color='orange', alpha=0.9) +
  geom_point(aes(x=full$fitted.values, y=full$residuals), color='orange', alpha=0.7) +
  geom_hline(yintercept=mean(full$residuals), color='black', linetype='dashed') +
  ylim(-6,6) +
  ggtitle('Fitted Values vs Residuals', 'Full Model') +
  xlab('Fitted Values') +
  ylab('Residuals') 

ggplot(c02) +
  geom_histogram(aes(x=full$residuals), bins=40, fill='orange', color='black', alpha=1) +
  xlim(-6,6) +
  ggtitle('Histogram of Residuals', 'Full Model') +
  xlab('Value') +
  ylab('Frequency') 



# (d) Interpret the respective summary statistics including the error metrics of your full model.
summary(full)

cooks_full <- cooks.distance(full)
p5 <- ggplot(c02, aes(x=seq(length(cooks_full)), y=cooks_full)) +
  geom_line(color='black', size=0.3, alpha=0.8) +
  geom_point(color='orange', size=0.7) +
  ylim(0, 0.03) +
  ggtitle("Cook's Distance", 'Full Model') +
  xlab('Index') +
  ylab("Cook's Distance")

p6 <- ggplot(data=c02) +
  geom_qq(aes(sample=full$residuals), color='orange', alpha=0.6) +
  ggtitle('QQ Normal Plot of Residuals', 'Full Model') +
  ylim(-6,6) +
  ylab('Sample Quantiles') +
  xlab('Theoretical Quantiles')

plot_grid(p5, p6, nrow=1)

# (e) Use the full model to forecast h-steps (at least 16) ahead. 
#     Your forecast should include the respective prediction interval.
# Make predictions for full model.
full_pred <- predict(full, newdata=forecasts, se.fit=TRUE)

# Get prediction and confidence intervals
full_c_int = as.data.frame(predict(full, forecasts, level=0.95, interval="confidence"))
full_p_int = as.data.frame(predict(full, forecasts, level=0.95, interval="prediction"))

# Append each variable to the NA padding vectors. Now they are all 481 in length.
full_df <- data.frame(
  date = c(c02$date, forecasts$date),
  ts = c(c02$ts, forecasts$ts),
  pred = c(na_457, full_pred$fit),
  p_upr = c(na_457, full_p_int$upr),
  p_lwr = c(na_457, full_p_int$lwr),
  c_upr = c(na_457, full_c_int$upr),
  c_lwr = c(na_457, full_c_int$lwr),
  ppm = c(c02$ppm, na_24),
  fits = c(full$fitted.values, na_24)
)

# Subset from 2010-2020 to zoom in on most recent data.
full_subset <- full_df[361:481, ]

ggplot(full_subset, aes(x=date)) +
  geom_line(aes(y=ppm, color='Observed Values'), na.rm=TRUE) +
  geom_line(aes(y=fits, color='Fitted Values'), linetype='dashed', na.rm=TRUE) +
  geom_ribbon(aes(ymin=p_lwr, ymax=p_upr, fill='Prediction Interval'), alpha=1, na.rm=TRUE) +
  geom_ribbon(aes(ymin=c_lwr, ymax=c_upr, fill='Confidence Interval'), alpha=0.8, na.rm=TRUE) +
  geom_line(aes(y=pred, color='Point Forecasts'), lwd=0.6, na.rm=TRUE) +
  ggtitle(expression('Forecasted CO'[2]*' Monthly Means'), '2010-2020') +
  xlab('Date') +
  ylab('PPM') +
  scale_fill_manual(name="Shades", values=cols) +
  scale_color_manual(name="Lines", values=cols)